The task of referring video object segmentation aims to segment the object in the frames of a given video to which the referring expressions refer. Previous methods adopt multi-stage approach and design complex pipelines to obtain promising results. Recently, the end-to-end method based on Transformer has proved its superiority. In this work, we draw on the advantages of the above methods to provide a simple and effective pipeline for RVOS. Firstly, We improve the state-of-the-art one-stage method ReferFormer to obtain mask sequences that are strongly correlated with language descriptions. Secondly, based on a reliable and high-quality keyframe, we leverage the superior performance of video object segmentation model to further enhance the quality and temporal consistency of the mask results. Our single model reaches 70.3 J &F on the Referring Youtube-VOS validation set and 63.0 on the test set. After ensemble, we achieve 64.1 on the final leaderboard, ranking 1st place on CVPR2022 Referring Youtube-VOS challenge. Code will be available at https://github.com/Zhiweihhh/cvpr2022-rvos-challenge.git.
translated by 谷歌翻译
Referring image segmentation aims to segment the target object described by a given natural language expression. Typically, referring expressions contain complex relationships between the target and its surrounding objects. The main challenge of this task is to understand the visual and linguistic content simultaneously and to find the referred object accurately among all instances in the image. Currently, the most effective way to solve the above problem is to obtain aligned multi-modal features by computing the correlation between visual and linguistic feature modalities under the supervision of the ground-truth mask. However, existing paradigms have difficulty in thoroughly understanding visual and linguistic content due to the inability to perceive information directly about surrounding objects that refer to the target. This prevents them from learning aligned multi-modal features, which leads to inaccurate segmentation. To address this issue, we present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features by guiding the interaction between vision and language through prior position information. Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment by comparing the features of the referred object with those of related objects. Extensive experiments on three benchmarks demonstrate our PCAN performs favorably against the state-of-the-art methods. Our code will be made publicly available.
translated by 谷歌翻译
Convolution neural networks (CNNs) have achieved remarkable success, but typically accompany high computation cost and numerous redundant weight parameters. To reduce the FLOPs, structure pruning is a popular approach to remove the entire hidden structures via introducing coarse-grained sparsity. Meanwhile, plentiful pruning works leverage fine-grained sparsity instead (sparsity are randomly distributed), whereas their sparse models lack special designed computing library for potential speedup. In this technical report, we study and present an efficient convolution neural network inference system to accelerate its forward pass by utilizing the fine-grained sparsity of compressed CNNs. Our developed FSCNN is established based on a set of specialized designed sparse data structures, operators and associated algorithms. Experimentally, we validate that FSCNN outperforms standard deep learning library PyTorch on popular CNN architectures such as VGG16 if sufficiently high sparsity exhibits. However, due to the contiguity issue of sparse operators, FSCNN is typically not comparable with highly optimized dense operator. Therefore, coarse-grained (structured) sparsity is our recommendation for generic model compression.
translated by 谷歌翻译
The survival analysis on histological whole-slide images (WSIs) is one of the most important means to estimate patient prognosis. Although many weakly-supervised deep learning models have been developed for gigapixel WSIs, their potential is generally restricted by classical survival analysis rules and fully-supervision requirements. As a result, these models provide patients only with a completely-certain point estimation of time-to-event, and they could only learn from the well-annotated WSI data currently at a small scale. To tackle these problems, we propose a novel adversarial multiple instance learning (AdvMIL) framework. This framework is based on adversarial time-to-event modeling, and it integrates the multiple instance learning (MIL) that is much necessary for WSI representation learning. It is a plug-and-play one, so that most existing WSI-based models with embedding-level MIL networks can be easily upgraded by applying this framework, gaining the improved ability of survival distribution estimation and semi-supervised learning. Our extensive experiments show that AdvMIL could not only bring performance improvement to mainstream WSI models at a relatively low computational cost, but also enable these models to learn from unlabeled data with semi-supervised learning. Our AdvMIL framework could promote the research of time-to-event modeling in computational pathology with its novel paradigm of adversarial MIL.
translated by 谷歌翻译
客户评论通常包含有关一个人在线购物体验的大量信息。尽管积极的评论对商店有益,但负面评论将在很大程度上影响消费者的决定,并可能导致销售下降。因此,仔细和有说服力地回答每个负面评论并最大程度地减少其不利影响至关重要。最近的研究考虑利用生成模型来帮助卖家做出回应。但是,此问题并不深入,因为评论可能包含问题的多个方面,这些方面应相应和有说服力地解决。在这项工作中,我们为有说服力的响应生成提出了一个多源多相关生成模型。提出的模型适当地获得和利用了各种信息来源,以产生更有信息和有说服力的响应。提出了一个多方面的细心网络,以自动参与审查中的不同方面,并确保解决大多数问题。在两个现实世界数据集上进行的广泛实验表明,我们的方法优于最先进的方法和在线测试,这证明我们的部署系统大大提高了商店处理负面评论的效率。
translated by 谷歌翻译
呼吸障碍(例如睡眠呼吸暂停)是一种严重的疾病,由于肺部含有/交换氧气和二氧化碳的能力不足,以确保身体处于稳定的稳态状态,因此会影响大量个体。呼吸测量(例如微小通风)可以与其他生理测量相关,例如远程监测健康和检测此类呼吸相关疾病的症状,例如心率和心率变异性。在这项工作中,我们制定了一种基于深度学习的方法来衡量私人数据集上的远程通风。接受这项工作后,数据集将公开。我们使用两个深度神经网络的两个版本来估计通过可穿戴心率和呼吸设备获得的数据流的微小通风。我们证明,我们的管道的简单设计(包括轻型深神经网络)可以轻松地纳入实时健康监测系统中。
translated by 谷歌翻译
在图像超分辨率中,需要像素的精度和感知忠诚度。但是,大多数深度学习方法仅在一个方面才能在一个方面实现高性能,并且由于感知能力的权衡,成功平衡权衡取舍的工作取决于从单独培训的模型和临时后处理的融合。在本文中,我们提出了一个具有低频约束(LFC-SR)的新型超分辨率模型,该模型通过单个模型平衡了客观和感知质量,并产生具有较高PSNR和知觉得分的超级分辨图像。我们进一步介绍了一种基于ADMM的交替优化方法,用于对受约束模型的非平凡学习。实验表明,我们的方法,没有麻烦的后处理程序,实现了最新的性能。该代码可在https://github.com/yuehan717/pdasr上找到。
translated by 谷歌翻译
在本文中,我们仅使用部分分布式反馈来研究全球奖励最大化的问题。这个问题是由几个现实世界应用程序(例如蜂窝网络配置,动态定价和政策选择)激发的,其中中央实体采取的行动会影响有助于全球奖励的大量人群。但是,从整个人群那里收集此类奖励反馈不仅会产生高昂的成本,而且经常导致隐私问题。为了解决此问题,我们考虑了差异的私有分布式线性土匪,其中只选择了来自人群的一部分用户(称为客户)来参与学习过程,并且中央服务器通过迭代地汇总这些部分从这种部分反馈中学习了全局模型客户的本地反馈以差异化的方式。然后,我们提出了一个统一的算法学习框架,称为差异性分布式分布式消除(DP-DPE),该框架可以与流行的差异隐私(DP)模型(包括中央DP,Local DP,Local DP和Shuffle DP)自然集成。此外,我们证明DP-DPE既可以达到统一的遗憾,又实现了额定性沟通成本。有趣的是,DP-DPE也可以“免费”获得隐私保护,这是因为由于隐私保证是一个较低的加法术语。此外,作为我们技术的副产品,对于标准的差异私有线性匪徒,也可以实现“自由”隐私的相同结果。最后,我们进行模拟以证实我们的理论结果并证明DP-DPE的有效性。
translated by 谷歌翻译
Gigapixel全斜面图像(WSIS)上的癌症预后一直是一项艰巨的任务。大多数现有方法仅着眼于单分辨率图像。利用图像金字塔增强WSI视觉表示的多分辨率方案尚未得到足够的关注。为了探索用于提高癌症预后准确性的多分辨率解决方案,本文提出了双流构建结构,以通过图像金字塔策略对WSI进行建模。该体系结构由两个子流组成:一个是用于低分辨率WSIS,另一个是针对高分辨率的WSIS。与其他方法相比,我们的方案具有三个亮点:(i)流和分辨率之间存在一对一的关系; (ii)添加了一个平方池层以对齐两个分辨率流的斑块,从而大大降低了计算成本并启用自然流特征融合; (iii)提出了一种基于跨注意的方法,以在低分辨率的指导下在空间上在空间上进行高分辨率斑块。我们验证了三个公共可用数据集的计划,来自1,911名患者的总数为3,101个WSI。实验结果验证(1)层次双流表示比单流的癌症预后更有效,在单个低分辨率和高分辨率流中,平均C-指数上升为5.0%和1.8% ; (2)我们的双流方案可以胜过当前最新方案,而C-Index的平均平均值为5.1%; (3)具有可观察到的生存差异的癌症疾病可能对模型复杂性具有不同的偏好。我们的计划可以作为进一步促进WSI预后研究的替代工具。
translated by 谷歌翻译
多EXIT体系结构由骨干和分支分类器组成,这些分类器提供缩短的推理途径,以减少深神经网络的运行时间。在本文中,我们分析了不同分支模式在分支分类器的计算复杂性分配方面有所不同。恒定复杂性分支使所有分支保持相同,同时复杂性增强和复杂性降低分支位置分别在骨架后期或更早的骨架上更复杂的分支。通过对多个骨干和数据集进行广泛的实验,我们发现复杂性削弱分支比恒定复杂性或复杂性增长分支更有效,这实现了最佳的准确性成本折衷。我们通过使用知识一致性来研究原因,以探测将分支添加到主链上的效果。我们的发现表明,复杂性降低的分支对骨干的特征抽象层次结构产生最小的破坏,这解释了分支模式的有效性。
translated by 谷歌翻译